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ABSTRACT 

Cyber security is one of the most significant technical challenges in 
current times. Detecting adversarial activities, prevention of theft 
of intellectual properties and customer data is a high priority for 
corporations and government agencies around the world. Cyher 
defenders need to analyze massive-scale, high-resolution network 
flows to identify, categorize, and mitigate attacks involving net¬ 
works spanning institutional and national boundaries. Many of the 
cyber attacks can be described as subgraph patterns, with promi¬ 
nent examples being insider infiltrations (path queries), denial of 
service (parallel paths) and malicious spreads (tree queries). This 
motivates us to explore subgraph matching on streaming graphs in 
a continuous setting. The novelty of our work lies in using the sub¬ 
graph distributional statistics collected from the streaming graph 
to determine the query processing shategy. We introduce a “Lazy 
Search" algorithm where the search strategy is decided on a vertex- 
to-vertex basis depending on the likelihood of a match in the vertex 
neighborhood. We also propose a metric named “Relative Selectiv¬ 
ity" that is used to select between different query processing strate¬ 
gies. Our experiments performed on real online news, network traf¬ 
fic stream and a synthetic social network benchmark demonstrate 
lO-lOOx speedups over selectivity agnostic approaches. 

1. INTRODUCTION 

Social media streams and cyber data sources such as computer 
network traffic are prominent examples of high throughput, dy¬ 
namic graphs. Application domains such as cyber security, emer¬ 
gency response, national security put a premium on discovering 
critical events as soon as they emerge in the data. Thus, processing 
streaming updates to a dynamic graph database for real-time situa¬ 
tional awareness is an important research problem. These particular 
data sources are also distinguished by their natural representation 
as heterogeneous or multi-relational graphs. For example, a social 
media data stream contains a diverse set of entity types such as per- 
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son, movie, images etc. and relations such as (friendship, like etc.). 
For cyber-security, a network traffic dataset can be modeled as a 
graph where vertices represent IP addresses and edges are typed by 
classes of network traffic QD Our work is focused on continuous 
querying of these dynamic, multi-relational graphs. 


e = { protocol: RemoteDesktopConnection, 
loginradminUser} 
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Figure 1: Graph based descriptions of attack patterns, a) In¬ 
sider infiltration: This pattern shows how an attacker may 
move laterally Inside an enterprise, b) Denial of Service at¬ 
tack, c) Information exfiltration: Victim browses a compro¬ 
mised website. This downloads a script which establishes com¬ 
munication with the botnet command and control. 


For social networks, we are often inundated with the stream of 
updates. Unless we choose to stay constantly connected to the so¬ 
cial networks, it is highly desirable to report only the important 
patterns/events as they occur in the data; for example, we may 
choose to ask "tell me when two friends are meeting at a nearby 
location". The stakes are much higher in the cyber-security do¬ 
main. As the volume and throughput of network traffic or event log 
datasets rise exponentially, the lack of ability to detect adversarial 
actions in real-time provides an asymmetric advantage to attack¬ 
ers. Internet backbone traffic collected by CAIDaQi, which we use 
later as a dataset in our experiments typically accumulate 40 million 
packets every minute. In a study titled “Data Breach Investigations 
Report", US communications company Verizon analyzed 100,000 
security incidents from the past decade and concluded that 90% of 
the incidents fell into ten attack patterns. A number of these attacks 
can be naturally described as graph patterns. Figure[T]shows graph 
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based patterns for a number of these attacks. Organizations such 
as internet service providers, content delivery networks etc. that 
receive network traffic from a wide area network are ideally poised 
to search for these attack patterns. Although there exists a signifi¬ 
cant number of graph databases and graph processing frameworks 
that scale to billion edge graphs, none of them support real-time 
subgraph pattern matching as a primary feature. Periodic export 
of network traffic flow or event alerts from log aggregation tools 
to a graph database, followed by post-attack querying on the static 
graph database is the most common workflow today. Despite cy¬ 
ber security being a multi-billion dollar market worldwide, the re¬ 
search on providing real-time querying capability on a single, large 
streaming graph is rather scarce. 

Continuous querying of a dynamic graph raises a number of 
unique challenges. Indexing techniques that preprocess a graph 
and speed up queries are expensive to periodically recompute in a 
dynamic setting. Periodic execution of the query is an obvious solu¬ 
tion under this condition, but the effectiveness of this approach will 
reduce as the interval between query executions shrinks. Also, pe¬ 
riodic searching of the entire graph can be wasteful where the query 
match emerges slowly because we will find a partial match for the 
query every time we search and potentially redo the work numerous 
times. Very recent publications by Gao et al and Mondal and 
Deshpande GD presents algorithms for implementing continuous 
queries on graphs. This motivates us to study the problem of sub¬ 
graph pattern matching in a streaming setting. We want to register 
a pattern as a graph query and continuously perform the query on 
the data graph as it evolves over time. 

In addition to the cyber attack patterns in Figure 1, social queries 
are also drawn from LSBench, a benchmark for reasoning on stream¬ 
ing SPARQL data. A common theme that emerges is that all these 
query graphs are heterogenous in nature. They are composed of dif¬ 
ferent edge types (in cyber security) as well as different node and 
edge types (in social media). None of the previous work on con¬ 
tinuous pattern detection has addressed this issue of heterogeneity. 
Exploiting the heterogeneity in both the query graph and the data 
graph stream, and improving over heterogeneity agnostic contin¬ 
uous pattern detection approaches is the primary contribution of 
our research. The primary ideas behind our approach is described 
below. We believe the simplicity of our approach is its greatest 
strength, and it will allow easy adoption of our optimizations into 
the distributed system implementations developed by others in the 
field. 



Figure 2: Framework for subgraph pattern matching on 
streaming graphs. 


Figure|^provides an overview of our approach. We approach the 
problem from an incremental processing perspective where search 
happens locally on every edge arrival. We do not search for the 
entire query graph around every new edge arriving in the stream. 
Given a query graph, the query optimizer decomposes it into smaller 
subgraphs as ordered by their selectivity. The selectivity informa¬ 
tion is obtained using the single-edge level and 2-edge path distri¬ 
bution obtained from the graph stream (section]^. We store the 
resulting decomposition into a data structure named SJ-Tree (Sub¬ 
graph Join Tree) (section]^ that tracks matching subgraphs in the 
data graph. For a new edge in the graph, we always search for the 
most selective subgraph of the query graph. For other subgraphs 
of the query graph, a search is triggered if and only if a match 
for the previous subgraph in the selectivity order was obtained in 
the neighborhood of the new edge. This algorithm named “Lazy 
Search" is described in section]^ We introduce two metrics. Ex¬ 
pected and Relative Selectivity, that captures the effectiveness of 
a given query decomposition (section |^. Further, we demonstrate 
how these metrics can be used to reason about the performance 
from different decompositions and select the best performing strat¬ 
egy- 

1.1 Contributions 

The most important takeaway from our work is that even as the 
subgraph isomorphism problem is NP-complete, it is possible to 
perform efficient continuous queries on dynamic graphs by exploit¬ 
ing the heterogeneity in the data and query graph. More specific 
contributions from the paper are listed below. 

1. We present a dynamic graph search algorithm that demon¬ 
strates speedup of multiple orders of magnitude with respect 
to the state of the art. 

2. We introduce two selectivity metrics for query graphs that are 
estimated using efficiently obtainable distributional statistics 
of single edge and 2-edge subgraphs from the graph stream. 

3. We present an automatic query decomposition algorithm that 
selects the best performing strategy using the aforementioned 
graph stream statistics and Relative Selectivity. 

Our observations are supported by experiments on datasets from 
three diverse domains (online news, computer network traffic and 
a social media stream). 

2. BACKGROUND AND RELATED WORK 

This section is aimed at providing an overview of the related 
field and provide the context for the studied problem. We begin 
with introducing the key concepts. 

Multi-Relational Graphs We define a graph G as an ordered- 
pair G — {V, E) where V is the set of vertices and the E is 
the set of edges that connect the vertices. In the following, we 
use V(G) and E{G) to indicate the set of vertices and edges as¬ 
sociated with a graph G. A labeled graph is a six-tuple G = 
{V, E,'Ev,E,e, Xv, ^e), where Sv and Eg are sets of distinct 
labels for vertices and edges. Xy and Xe are vertex and edge la¬ 
beling functions, i.e. Av : V —^ Sv and Xe ■ E ^ He. 

Dynamic Graphs We define dynamic graphs as graphs that are 
changing over time through edge insertion or deletion. Every edge 
in a dynamic graph has a timestamp associated with it and there¬ 
fore, for any subgraph g of a dynamic graph we can define a time 
interval T{g) which is equal to the interval between the earliest and 
latest edge belonging to g. We focus on directed, labeled dynamic 
graphs with multi-edges in this work. The graph is maintained as 

























a window in time. Given a time window tw, edges are deleted as 
they become older than hast — tw, where tiast is the timestamp of 
the newest edge in the graph. 

Subgraph Isomorphism Given the query graph Gq and a match¬ 
ing subgraph of the data graph (Gd) denoted as G^, a matching be¬ 
tween Gq and Gd involves finding a bijective function / : V(Gq) —>■ 
V (Gd) such that for any two vertices ui,U 2 € V (Gq), (wi , U 2 ) G 
E{Gq)^{fiui),fiu2))eE{G'd). 

2.1 Problem Statement 

Every edge in a dynamic graph has a timestamp associated with 
it and therefore, for any subgraph g of a dynamic graph we can de¬ 
fine a time duration r(g) which is equal to the duration between 
the earliest and latest edge belonging to g. Given a dynamic multi- 
relational graph Gd, a query graph Gq and a time window tw, we 
report whenever a subgraph gd that is isomorphic to Gq appears 
in Gd such that T{gd) < tw- The isomorphic subgraphs are also 
referred to as matches in the subsequent discussions. Assume that 
GJ is the data graph at time step k. If M{Gd) is the cumulative 
set of all matches discovered until time step k and Ek-i-i is the set 
of edges that arrive at time step fc -|- 1, we present an algorithm to 
compute a function / (Gd, Gq, Ek+i) which returns the incremen¬ 
tal set of matches that result from updating Gd with E^+i and is 
equal to M(G2+^)-M(GS). 

2.2 Related Work 

Graph querying techniques have been studied extensively in the 
field of pattern recognition over nearly four decades Q. Two pop¬ 
ular subgraph isomorphism algorithms were developed by Ullman 
1^ and Cordelia et al. Q. The VF2 algorithm Q employs a filter¬ 
ing and verification strategy and outperforms the original algorithm 
by Ullman. Over the past decade, the database community has 
focused strongly on developing indexing and query optimization 
techniques to speed up the searching process. A common theme 
of such approaches is to index vertices based on k-hop neighbor¬ 
hood signatures derived from labels and other properties such as 
degrees and centrality |17[|18[[2^ . Other major areas of work in¬ 
volve exploration of subgraph equivalence classes Q and search 
techniques for alternative representations such as similarity search 
in a multi-dimensional vector space GD- Apart from neighborhood 
based signatures, graph sketches is an important area that focuses 
on generating different synopses of a graph data set |22| . Develop¬ 
ment of efficient graph sketching algorithms and their applications 
into query estimation is expected to gain prominence in the near 
future. 

Investigation of subgraph isomorphism for dynamic graphs did 
not receive much attention until recently. It introduces new algo¬ 
rithmic challenges because we can not afford to index a dynamic 
graph frequently enough for applications with real-time constraints. 
In fact this is a problem with searches on large static graphs as 
well p^ . There are two alternatives in that direction. We can 
search for a pattern repeatedly or we can adopt an incremental ap¬ 
proach. The work by Fan et al. Q presents incremental algorithms 
for graph pattern matching. However, their solution to subgraph 
isomorphism is based on the repeated search strategy. Chen et 
al. Q proposed a feature structure called the node-neighbor tree 
to search multiple graph streams using a vector space approach. 
They relax the exact match requirement and require significant pre¬ 
processing on the graph stream. Our work is distinguished by its 
focus on temporal queries and handling of partial matches as they 
are tracked over time using a novel data structure. From a data- 
organization perspective, the SJ-Tree approach has similarities with 


the Closure-Tree Q. However, the closure-tree approach assumes 
a database of independent graphs and the underlying data is not dy¬ 
namic. There are strong parallels between our algorithm and the 
very recent work by Sun et al. (ig, where they implement a query- 
decomposition based algorithm for searching a large static graph 
in a distributed environment. Here our work is distinguished by 
the focus on continuous queries that involves maintenance of par¬ 
tial matches as driven by the query decomposition structure, and 
optimizations for real-time query processing. Mondal and Desh- 
pande GD propose solutions to supporting continuous ego-centric 
queries in a dynamic graph, Our work focuses on subgraph isomor¬ 
phism, while GD is primarily focused on aggregate queries. We 
view this as complementary to our work, and it affirms our belief 
that continuous queries on graphs is an important problem area, 
and new algorithms and data structures are required for its devel¬ 
opment. 

The query pattern matching approach recently proposed in 
is most closely related to our work with some important distinc¬ 
tions. The authors build a vertex centric, query processing engine 
for dynamic graphs on top of Apache Giraph, a distributed com¬ 
puting framework inspired by the Pregel framework. Their query 
decomposition approach is based on identifying optimal sub-DAGs 
(directed acyclic graph) in the query graph. The DAGs’ are then 
traversed to identify source and sink vertices to define message 
transition rules in the Giraph framework. Although they address 
significant challenges inherent of processing dynamic graphs, it is 
not suitable for all types of queries. Specifically, queries that have 
cyclic communications, such as infiltration attack query in Figure 
[T] cannot be decomposed in DAG to find exact matches. Also, in 
our work we exclusively focus on query graphs with labeled edges 
with specific constraints. This are not addressed in the framework 
proposed in ij^. Our work makes no assumptions about the query 
graph structure and will find exact matches even when there is no 
apparent sink vertices. Moreover, the focus in is on distributed 
implementation, while we focus on selectivity based query decom¬ 
position - that can improve performance for heterogeneous graphs. 
We show via edge distribution and selectivity plots that real world 
heterogeneous graphs have a strong skew in subgraph selectivity. 
The novelty of our work lies in estimating the selectivity of sub¬ 
graphs from the graph stream and using the selectivity to determine 
the subgraph search strategy. 

In summary, we consider these works to pursue two related but 
distinct directions that needs to be implemented in a scalable sys¬ 
tem. 

3. A QUERY DECOMPOSITION APPROACH 

We introduce an approach that guides the search process to look 
for specific subgraphs of the query graph and follow specihc tran¬ 
sitions from small to larger matches. Following are the main intu¬ 
itions that drive this approach. 

1. Instead of looking for a match with the entire graph or just 
any edge of the query graph, partition the query graph into 
smaller subgraphs and search for them. 

2. Track the matches with individual subgraphs and combine 
them to produce progressively larger matches. 

3. Define a join order in which the individual matching sub¬ 
graphs will be combined. Do not look for every possible 
way to combine the matching subgraphs. 

Figure ig shows an illustration of the idea. Although the current 
work is completely focused on temporal queries, the graph decom¬ 
position approach is suited for a broader class of applications and 
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Figure 3: Illustration of the decomposition of a social query in 
SJ-Tree. 


queries. The key aspect here is to search for substructures with¬ 
out incurring too much cost. Even if some subgraphs of the query 
graph are matched in the data, we will not attempt to assemble the 
matches together without following the join order. 

The query decomposition approach can still suffer from having 
to maintain too many partial matches. If a subgraph of the query 
graph is highly frequent, we will end up tracking a large number 
of partial matches corresponding to that subgraph. Unless we have 
quantitative knowledge about how these partial matches transition 
into larger matches, we face the risk of tracking a large number of 
non-promising matching subgraphs. The “Lazy Search" approach 
outlined earlier in the introduction enhances this further. For any 
new edge, we search for a query subgraph if and only if it is the 
most selective subgraph in the query or if one of the either ver¬ 
tices in that edge participates in a match with the preceding (query) 
subgraph in the join order. 

This section is dedicated towards introducing the data structures 
and algorithms for dynamic graph search. We begin with introduc¬ 
ing the SJ-Tree structure (section [jT) and then proceed to present 
the basic algorithms (Algorithm 1 and 2). The “Lazy Search"- 
enhanced version is introduced later in section]^ Automated gen¬ 
eration of SJ-Tree is covered in section|5] 

3.1 Subgraph Join Tree (SJ-Tree) 

We introduce a tree structure called Subgraph Join Tree (SJ- 
Tree). SJ-Tree defines the decomposition of the query graph into 
smaller subgraphs and is responsible for storing the partial matches 
to the query. Figure shows the decomposition of an example 
query. Each of the rectangular boxes with dotted lines will be rep¬ 
resented as a node in the SJ-Tree. The query subgraphs shown in¬ 
side each “box" will be stored as a node property described below. 

Definition |3. 1| 1 A SJ-Tree T is defined as a binary tree com¬ 
prised of the node set Nt- Each n £ Nt corresponds to a subgraph 
of the query graph Gq. Let’s assume Vsg is the set of correspond¬ 
ing subgraphs and |Usg| = |A^t|. Additional properties of the 
SJ-Tree are defined below. 

Definition [LT] 2 A Match or a Partial Match is as a set of 
edge pairs. Each edge pair represents a mapping between an edge 
in a query graph and its corresponding edge in the data graph. 

Definition |3.1[ 3 Given two graphs Gi = (Vi, i?i) and G 2 = 
( 14 , the join operation is defined as G 3 = Gi N G 2 , such 
that G3 = (V3, Es) where V3 = Ti U V2 and E 3 = EqU i?2. 


Property 1. The subgraph corresponding to the root of the SJ- 
Tree is isomorphic to the query graph. Thus, for Ur = root{T}, 
Vseirir} = Gq. 


Property 2. The subgraph corresponding to any internal node 
of T is isomorphic to the output of the join operation between the 
subgraphs corresponding to its children. If n; and Ur are the left 
and right child of n, then VsG{n} = VsG{ni} VsG{nr}. 

Therefore, each leaf of the SJ-Tree represent subgraphs that we 
want to search for (perform subgraph isomorphism) on the stream¬ 
ing updates. Internal nodes in the SJ-Tree represents subgraphs 
that result from the joining of subgraphs returned by the subgraph 
isomorphism operations. 

Property 3 . Each node in the SJ-Tree maintains a set of matches. 
We define a function matches(n) that for any node n £ Nt, re¬ 
turns a set of subgraphs of the data graph. If M = matches{n), 
thenVGm £ M, Gm = VsG{n}. 

Property 4. Each internal node n in the SJ-Tree maintains a 
subgraph, CUT-SUBGRAPH(n) that equals the intersection of the 
query subgraphs of its child nodes. 

For any internal node n £ A^t such that CUT-SUBGRAPH(n) 7 ^ 
0, we also define a projection operator U. Assume that Gi and G 2 
are isomorphic, Gi = G 2 . Also define and >!>£; as functions 
that define the bijective mapping between the vertices and edges of 
Gi and G 2 . Consider gi, a subgraph of Gi: gi C Gi. Then g2 = 
n(G 2 ,gi) is a subgraph of G 2 such that V{g 2 ) = ^v{V (gi)) 
and E(g 2 ) = ‘Pe(E (gi)). 

Our decision to use a binary tree as opposed to an n-ary tree is 
influenced by the simplicity and lowering the combinatorial cost 
of joining matches from multiple children. With the properties of 
the SJ-Tree defined, we are now ready to describe the graph search 
algorithm. 

3.2 Dynamic Graph Search Algorithm 


Algorithm 1 DYNAMIC-GRAPH-SEARCH(Gd, T, edges) 
1: leaf-nodes =GET-LEAF-NODES(r) 

2 : for all Cs £ edges do 
3: UPDATE-GRAPH(Gd,e.) 

4: for all n £ leaf-nodes do 

5: gLb =GET-QUERY-SUBGRAPH(r, n) 

6 : matches =SUBGRAPH-ISO(Gd, es) 

7: if matches 7 ^ 0 then 

8 : for all m £ matches do 

9: UPDATE-SJ-TREE(r, n, m) 


We begin with describing our dynamic graph search algorithm 
(Algorithm [T] and 1^. The input to DYNAMIC-GRAPH-SEARCH 
is the dynamic graph so far Gd, the SJ-Tree (T) corresponding to 
the query graph and the set of incoming edges. Every incoming 
edge is first added to the graph (Algorithm 1, line 3). Next, we 
iterate over all the query subgraphs to search for matches contain¬ 
ing the new edge (line 5-6). Any discovered match is added to the 
SJ-Tree (line 9). 

Next, we describe the UPDATE-SJ-TREE function. Each node 
in the SJ-Tree maintains its sibling and parent node information 
(Algorithm 2, line 1-2). Also, each node in the SJ-Tree maintains 
a hash table (referred by the match-tables property in Algorithm 2, 
line 4). GET() and ADD() provides lookup and update operations 
on the hash tables. Each entry in the hash table refers to a Match. 
Whenever a new matching subgraph g is added to a node in the SJ- 
Tree, we compute a key using its projection (n( 5 )) and insert the 
key and the matching subgraph into the corresponding hash table 
(line 12). When a new match is inserted into a leaf node we check 
to see if it can be combined (referred as JOIN()) with any matches 
that are contained in the collection maintained at its sibling node. 






A successful combination of matching subgraphs between the leaf 
and its sibling node leads to the insertion of a larger match at the 
parent node. This process is repeated recursively (line 11) as long 
as larger matching subgraphs can be produced by moving up in the 
SJ-Tree. A complete match is found when two matches belonging 
to the children of the root node are combined successfully. 

Example Let us revisit Figurej^for an example. Assuming we 
find a match with the query subgraph containing a single “friend" 
edge (e.g. {(“George", “friend", “John")}), we will probe the hash 
table in the leaf node with “likes" edges. If the hash table stored 
a subgraph such as {(“John", “likes", “Santana")}, the JOIN() will 
produce a 2-edge subgraph {(“George", “friend", “John"), (“John", 
“likes", “Santana")}. Next, it will be inserted into the parent node 
with 2-edges. The same process will be subsequently repeated, 
beginning with the probing of the hash table storing matches with 
subgraphs with a “follows" edge. 


Algorithm 2 UPDATE-SJ-TREE(node, m) 

1: sibling = sibling[node] 

2: parent = parent[node] 

3: k =GET-JOIN-KEY(CUT-SUBGRAPH[parenf], m) 
4: Hs = match-tables[sib(ing] 

5; = GEl{Hs,k) 

6; for all nis G Mg do 
7: ms„p = JOIN(ms,m) 

8 : if parent = root then 

9: PRINT(’MATCH FOUND : ’, nisuv) 

10: else 

11: UPDATE-SJ-TREE(parenf, nisup) 

12: ADD(match-tables[node], fc, m) 


4. LAZY SEARCH 

Revisiting our example from Figure]^ it is reasonable to assume 
that the “friend" relation is highly frequent in the data. If we de¬ 
composed the query graph all the way to single edges then we will 
be tracking all edges that match “friend". Clearly, this is waste¬ 
ful. One may suggest decomposing the query to larger subgraphs. 
However, it will also increase the average time incurred in per¬ 
forming subgraph isomorphism. Deciding the right granularity of 
decomposition requires significant knowledge about the dynamic 
graph. This motivates us to introduce a new algorithmic extension. 

Assume the query graph Gq is partitioned into two subgraphs 
gi and G\. We use the notation GJ to indicate what remains of 
Gq after the fe-th iteration in the decomposition process. If the 
probability of finding a match for g\ is less than the probability 
of finding a match for G\, then it is always desirable to search 
for gi and look for Gq only where an occurrence of gi is found. 
Therefore, we select gi to be the most selective edge or 2-edge 
subgraph in the query graph and always search for gi around every 
new edge in the graph. Once we detect subgraphs in Gd that match 
with gi, we follow the same approach to search for Gq in their 
neighborhood. We partition G\ further into two subgraphs: g 2 and 
Gq, where g 2 is another l-edge or 2-edge subgraph. 

Data Structures With the SJ-Tree, the partitioning of Gq is 
done upfront at the query compile time with gi, g 2 etc becoming 
the leaves of the tree. The main difference between Lazy Search 
and that of Algorithm 2 is that we will be searching for g 2 only 
around the edges in Gd where a match with gi is found. Therefore, 
for every vertex u in Gd, we need to keep track of the gt-s such 
that u is present in the matching subgraph for gi. We use a bitmap 
structure Mi, to maintain this information. Each row in the bitmap 


refers to a vertex in Gd and the i-th column refers to gi, or the i- 
th leaf in the SJ-Tree. If the search for subgraph gi is enabled for 
vertex u in Gd, then Mi,[u][z] = 1 and zero otherwise. Whenever 
a matching subgraph g' for gi is discovered, we turn on the search 
for gi+i for all vertices in V{g'). This is accomplished by setting 
Mi,[v][i -b 1] = 1 where u G V(g'). 

Robustness with Subgraph Arrival Order Consider a 
SJ-Tree with just two leaves representing query subgraphs gi and 
g 2 , with gi representing the more selective left leaf. The above 
strategy is not robust to the arrival order of matches. Assume g^ 
and g 2 are subgraphs of Gd that are isomorphic to g\ and g 2 re¬ 
spectively. Together, gq x 32 is isomorphic to the query graph Gq. 
Because we are searching for gq on every incoming edge, gq will 
be detected as soon as it appears in the data graph. However, we 
will detect g 2 only if appears in Gd after gq. If g 2 appeared in Gd 
before gq we will not find it because we are not searching for g 2 all 
the time. 

We introduce a small change to address this temporal ordering 
issue. Whenever we enable the search on a node in the data graph, 
we also perform a subgraph search around the node to find any 
match that has occurred earlier. Thus, when we find gq and enable 
the search for g 2 on every subsequent edge arrival, we also perform 
a search in Gd looking for gq. This ensures that we will find g 2 
even if it appeared before gq. 

Algorithmic summarizes the entire process. Lines 2-3 loop over 
all news edges arriving in the graph and update the graph. Next, 
given a new edge Cg, for each node in the SJ-Tree, we check to see 
if we should be searching for its corresponding subgraph around 
Bg (lines 4-8). The DISABLED() function queries the bitmap in¬ 
dex and returns true if the corresponding search task is disabled. 
GET-QUERY-SUBGRAPH returns the query subgraph corre¬ 
sponding to node n in the SJ-Tree (line 9). Next, we search for 
9 sub using a subgraph-isomorphism routine that only searches for 
matches containing at least one of the end-point vertices of Bg (u 
and V, mentioned in line 5-6). For each matching subgraph found 
containing u or u, we enable the search for the query subgraph 
corresponding the sibling of n in the SJ-Tree. If n was not left- 
deep most node in the SJ-Tree, then we also query the left sibling 
to probe for potential join candidates (QUERY-SIBLING-JOIN(), 
line 16). Any resultant joins are pushed into the parent node and 
the entire process is recursively repeated at one level higher in the 
SJ-Tree. 

5. SJ-TREE GENERATION 

Here we address the topic of automatic generation of the SJ-Tree 
from a specified query graph. We begin with introducing key defi¬ 
nitions, followed by the decomposition algorithm. 

DEEINITION Subgraph Selectivity Given a large typed, directed 
graph G, the selectivity of a typed, directed subgraph g with k- 
edges (denoted as S{g)) is the ratio of the number of occurrences 
of g and the total number of all fc-edge subgraphs in G. Instances 
of g may overlap with each other. 

DEEINITION Selectivity Distribution The selectivity distribu¬ 
tion of a set of subgraphs Gk is a vector containing the selectiv¬ 
ity for every subgraph in Gk. The subgraphs are ordered by their 
frequencies in ascending order. 

We present a greedy algorithm (Algorithm for decomposing 
a query graph into its subgraphs and generating a SJ-Tree. Our 
choice for the greedy heuristic is motivated by extensive survey 
of the literature on optimal join order determination in relational 
databases |10||14[[2H . A key conclusion of the survey states that 
left-deep join plans (or left deep binary trees in this case) is one 





Algorithm 3 LAZY-SEARCH(Gd, T, edges) 

1: Zea/-nodes =GET-LEAF-NODES(r) 

2: for all es G edges do 
3: UPDATE-GRAPH(Gd, e.) 

4: for all n G leaf-nodes do 

5; u =src(es) 

6: V =dst(es) 

7: if DISABLED(u, n) AND DISABLED(v, n) then 

8: continue 

9: aLb =GET-QUERY-SUBGRAPH(T, n) 

10: matches =SUBGRAPH-ISO(Gd, , e) 

11 : for all m G matches do 

12: if n = 0 then 

13: ENABLE-SEARCH-SIBLING(n, m) 

14: else 

15: Mj = QUERY-SIBLING-JOIN(n, m) 

16: p = PARENT(n) 

17: for all mj € Mj do 

18: UPDATE(p, mj) 

19: ENABLE-SEARCH-SIBLING(p, m) 


of the best performing heuristics. The above mentioned studies 
point to a large body of research using techniques such as dynamic 
programming and genetic algorithms to find the optimal join or¬ 
der. Nonetheless, finding the lowest cost join order or using a cost- 
driven join order determination remains an interesting problem in 
graph databases, and the approaches based on minimum spanning 
trees or approximate vertex cover can provide an initial path for¬ 
ward. 

Inputs to Algorithmic are the query graph Gq and an ordered set 
of primitives M. Our goal is to decompose Gq into a collection 
of (possibly repeated) subgraphs chosen from M. Entries of M 
are sorted in ascending order of their subgraph selectivity. Given 
a query graph Gq, the algorithm begins with finding the subgraph 
with the lowest selectivity in M. This subgraph is next removed 
from the query graph and the nodes of the removed subgraph are 
pushed into a “frontier" set. We proceed by searching for the next 
selective subgraph that includes at least one node from the fron¬ 
tier set. We continue this process until the query graph is empty. 
SUBGRAPH-ISO performs a subgraph isomorphism operation to 
find an instance of gM in Gq. Algorithm |C uses two versions of 
SUBGRAPH-ISO. The first version uses three arguments, where 
the second argument is a vertex id v. This version of SUBGRAPH- 
ISO searches Gq for instances of gM by only searching in the 
neighborhood of v. The other version accepting two arguments 
searches entire Gq for an instance of gM- REMOVE-SUBGRAPH 
accepts two graphs as argument, where the second argument (gsub) 
is a subgraph of the first graph (Gq). It removes all edges in Gq 
that belong to gaub. A vertex is removed from Gq only when the 
edge removal results in a disconnected vertex. 

5.1 Selectivity Estimation of Primitives 

We propose computing the selectivity distribution of primitives 
by processing an initial set of edges from the graph stream. For 
experimentation purposes we assume that the selectivity order re¬ 
mains the same for the dynamic graph when we perform the query 
processing. This work does not focus on modeling the accuracy of 
this estimation. Modeling the impact on performance when the ac¬ 
tual selectivity order deviates from the estimated selectivity order 
is an area of ongoing work. 

Which subgraphs are good candidates as entries of M? Fol¬ 
lowing are two desirable properties for entries in M: 1) the cost 


Algorithm 4 BUILD-SJ-TREE(G5, M) 

1: frontier = 0 
2: while \V{Gq)\ > 0 do 

3: gsub = 0 

4: for all gM ^ M do 

5: if frontier ^ 0 then 

6: for all v G frontier do 

7: =SUBGRAPH-ISO(Gg,t;,5M) 

8: break 

9: else 

10: =SUBGRAPH-ISO(G„ ffAf) 

11: if psub 7 ^ 0 then 

12: frontier = frontier U V{gaub) 

13: Gq =REMOVE-SUBGRAPH(G9,3s„i,) 


for subgraph isomorphism should be low. 2) Selectivity estimation 
of these subgraphs should be efficient as we will need to periodi¬ 
cally recompute the estimates from a graph stream. Based on these 
two criteria, we select single edge subgraphs and 2-edge paths as 
primitives in this study. Computing the selectivity distribution for 
single-edge subgraphs resolves to computing a histogram of vari¬ 
ous edge types. The selectivity distribution for 2-edge paths on a 
graph with V nodes, E vertices and k un^ue edge types can be 
done in 0(V(E -|- k^)) time. Algorithm [^provides a simple al¬ 
gorithm to count all 2-edge paths. In our experiments, computing 
the path statistics for a network traffic dataset with 800K nodes and 
nearly 130 million edges takes about 50 seconds without any code 
optimization. 

Algorithm uses a Counter() data structure, which is a hash- 
table where given a key, the corresponding value indicates the num¬ 
ber of times the key occurred in the data. A Counter() is updated 
via the UPDATE routine, which accepts the counter object, a key 
value and an integer to increment the corresponding key count. We 
iterate over all vertices in the input graph (Gd) (line 2). For an 
given vertex v, we count the number of occurrences of each unique 
edge type associated with it (accounting for edge directions). Line 
8 iterates over all unique edge types associated with v. Next, given 
an edge type ei and its count ni, we count the number of combi¬ 
nations possible with two edges of same type ((2)). Next, we com¬ 
pute the number of 2-edge paths that can be generated with ei and 
any other edge type 62. We impose the LEXICALLY-GREATER 
constraint to ensure each edge is factored in only once in the 2-edge 
path distribution. 

Note that we use a MapQ function instead of simply using the 
type associated with every edge. Most of our target applications 
have significant amount edge attributes in the graphs. As an ex¬ 
ample, in a network traffic graph we use the protocol information 
to determine the edge property. Thus, each network flow with the 
same protocol (e.g. HTTP, ICMP etc.) are mapped to the same 
edge type. Each flow is accompanied by multiple attributes such 
as source and destination ports, duration of communication etc.. 
Therefore, we can provide a hash function to map any user de¬ 
fined edge properties to an integer value. Thus, for queries with 
constraints on vertex and edge properties, a generic map function 
factors in both structural and semantic characteristics of the graph 
stream. 

Counting the frequency for larger subgraphs is important. Given 
a query graph with M edges, ideally we would like to know the 
frequency of all subgraphs with size 1, 2,.., M — 1. Collecting the 
frequency of larger subgraphs, specifically triangles have received 
a significant attention in the database and data mining community 










Exhaustive enumeration of all the triangles can be expensive, 
specially in the presence of high degree vertices in the data. Ap¬ 
proximate triangle counting via sampling for streaming and semi¬ 
streaming has been extensively studied in the recent years (TT). We 
foresee incorporation of such algorithms to support better query 
optimization capabilities for queries with triangles. 


Algorithm 5 COUNT-2-EDGE-PATHS(G'd) 

1 : P = Counter{) 

2; for al\v £V (Gd) do 
3; Cv = Counter^) 

4: for all e G Neighbors{Gd, u) do 

5: et = Map(e) 

6 : Update(Cv, et, 1) 

7: Et = Keys(C„) 

8 : for all ei e Pt do 

9: ni = Count{Cv,ei) 

10 : key = {ei,ei) 

11: Update{P, key, ni{ni — l)/2) 

12: for all e 2 g T, EXTCA T.T.Y -GREATER( Et , ei) do 

13: n 2 = Count {Cv,e 2 ) 

14: fcej/= (ei, 62 ) 

15: Update{P,key,nin 2 ) 


5.2 Query Decomposition Strategies 

Algorithm 1^ shows that we can generate multiple SJ-Trees for 
the same Gq by selecting different primitive sets for M. We can 
initiate M with only 1-edge subgraphs, only 2-edge subgraphs or a 
mix of both. As an example, for a 4-edge query graph, the removal 
of the first 2 -edge subgraph can leave us with 2 isolated edges in 
Gq. At that stage, we will create two leaf nodes in the SJ-Tree 
with 1-edge subgraphs. Eor brevity we refer to both the second and 
third choice as 2 -edge decomposition in the remaining discussions. 
Clearly, these 1 or 2-edge based decomposition strategies has dif¬ 
ferent performance implications. Searching for 1-edge subgraphs 
is extremely fast. However, we stand to pay the price with mem¬ 
ory usage if these l-edge subgraphs are highly frequent. On the 
contrary, we expect 2 -edge subgraphs to be more discriminative. 
Thus, we will trade off lowering the memory usage by spending 
more time searching for larger, discriminative subgraphs on every 
incoming edge. 

DEFINITION Expected Selectivity We introduce a metric called 
Expected Selectivity, denoted as S{Tk). Given a SJ-Tree Tk, the 
Expected Selectivity is defined as the product of the selectivities of 
the leaf-level query subgraphs. 

leaves{Tk) returns the set of leaves in a SJ-Tree Tk- Given a 
node n, Vsg{T, n) returns the subgraph corresponding to node n 
in SJ-Tree T. Einally, S{g) is the selectivity of the subgraph g as 
defined earlier. 

S{Tk)= n S{VsG{Tk,n)) (1) 

n^leaves{Tk) 

DEFINITION Relative Selectivity We introduce a metric called 
Relative Selectivity, denoted as ^(Tk,Ti). Given a 1-edge decom¬ 
position Ti and another decomposition Tk, we define ^(Tk,Ti) as 
follows. 

■ JN) ' 

We conclude the section with discussion on two desirable prop¬ 
erties of a greedy SJ-Tree generation strategy. 


Theorem 1 Given the data graph Gd at any time t, assume that 
the query graph Gq is not guaranteed to be present in Gd- Then ini¬ 
tiating the search for Gq by searching for gram where grave C Gq 
andVg C Gq\\E{g)\ = \E{grare)\, frequency{g) > frequency {grave) 
is in optimal strategy. 

Proof The time complexity for searching for a 0(1) for a 1- 
edge subgraph and 0{dv) for a 2-edge subgraph. Therefore, the 
runtime cost to search for grave is same as any other subgraph of 
Gq with the same number of edges. However, searching for grave 
will require minimum space because it has the minimum frequency 
amidst all subgraphs with same size. Therefore, searching for grave 
is an optimal strategy. 



Figure 4: Example SJ-Tree used in proof of theorem 2. 

Theorem 2 Given a set of identical size subgraphs {gk} such 
that Ukpk = Gq, a SJ-Tree with ordered leaves gk A gk+i A 
gk+2 requires minimal space when frequency{gk gk+i) < 
frequency{gk+2). 

Proof By induction. Assume a SJ-Tree with three leaves as 
shown in Figure]^ Following the definitions of SJ-Tree, this is a 
left-deep binary tree with 3 leaves. Therefore, frequency{c) de¬ 
noted in shorthand as /(c) /(c) = min{f{a), f{b)). Substituting 
for the frequency of c, space requirement for this tree S{T) = 
f{a) -I- f{b) + f{d) -f min{f{a), f{b)). Thus, the space require¬ 
ment for this tree is minimum if /(a) < f{b) < f{c). 

Now we can consider any arbitrary tree where T„ refers to a tree 
with a left subtree T„.^ and a right child ln+2. Above shows that Ti 
constructed as above will have minimum space requirement, and so 
will Ta if/(a) </(&)</(c) < /(d). 

Observation 3 Given gk, a subgraph of query graph Gq, it is 
efficient to decompose gk if there is a subgraph g G gk, such that 

frequency (gr) > ’ "'here d is the average vertex 

degree of the data graph and \ V{gk)\ is the number of vertices in 
gk. 

Proof Given a graph g, the average cost for searching for an¬ 
other graph that is larger by a single edge is d multiplied by the 
number of vertices in gk, and the proof follows. 

Space Complexity The space complexity of the SJ-Tree can be 
measured in terms of the storage required by each leaf in the tree. 
The storage for any node in the tree is approximated by the prod¬ 
uct of the corresponding subgraph size (measured as the number of 
edges) and its frequency. Therefore, the space complexity of the 
SJ-Tree is S{T) = \E{gk)\frequency{gk). Given two sub¬ 
graphs gsmaii and gug, where gbig contains gsmaii, the frequency 
of gsmaii serves as an upper bound for gbig, assuming no over¬ 
lapping edges. Therefore, we can assign each node in the tree to 
a group, where one node in each group serves to approximate the 
frequency of rest of the nodes in the group. Suppose gr{i) is the 
cardinality of the i-th group. Trivially, gr{i) = Nt, where Nt 
is the number of nodes in the SJ-Tree. 

Therefore, given a query graph Gq and a SJ-Tree T express¬ 
ing one possible query decomposition, we can estimate its space 











complexity as S(T) = gr{i)\E{gi)\frequency{gi). There 
is clearly a tradeoff between the accuracy of this estimate and the 
computation required to obtain the necessary measurements. Ap¬ 
proximating the space complexity in terms of single edge subgraphs 
is computationally easiest, although it would be a very loose bound 
when the frequency of a single edge subgraph is orders of magni¬ 
tude higher than larger subgraphs containing that single edge sub¬ 
graph. Realistically, we foresee the groups being composed of 
unique 1-edge, 2-edge subgraphs and triangles (if it exists in the SJ- 
Tree) and approximate all larger subgraph in the SJ-Tree assigned 
to these groups. 

5.3 Comparison with selectivity agnostic ap¬ 
proaches 

Our pattern decomposition approach based on relative selectivity 
provides an optimal way to look for discriminate patterns compared 
to existing approaches. For e.g, consider the generic path query 
graph in [^a). A DAG based decomposition approach (7) may 
look either for complete path query or decompose it randomly as 
shown in [^b). As the source vertex(sl) in such a pattern may be 
lot more frequent than sink v4, our selectivity based approach will 
clearly identify the s2->s3->s4 pattern as being more selective and 
start processing search from there, clearly this is more optimal than 
searching for every pattern starting at sl->s2. 


SI < S2 < S3 


o^- 

vl v2 


e2j 

(a) 


v3 v4 



Figure 5: (a) Example path query. Si indicates the selectivity of 
edge Ci. (b) A selectivity agnostic decomposition. ( c ) Decom¬ 
position using our selectivity based approach. 


6. EXPERIMENTAL STUDIES 

We perform experimental analysis on two real-world datasets 
(New York Times Q (Internet Backbone Traffic data[^ and a syn¬ 
thetic streaming RDF benchmark. In interest of space, we include 
result for CAIDA dataseti and RDF benchark only, NYTimes per¬ 
formance being similar to CAIDA. The experiments are performed 
to answer questions in the following categories. 

1. Studying Selectivity Distribution What does the se¬ 
lectivity distribution of 2-edge subgraphs look like in real 
world datasets? What is the duration of time for which the se¬ 
lectivity distribution or selectivity order of 2-edge subgraphs 
remains static? 

2. Comparison between Search strategies In the pre¬ 
vious sections, we introduced two different choices for query 
decomposition (1-edge vs 2-edge path based) and two differ¬ 
ent choices for query execution (lazy vs non-lazy). Flow do 
the strategies compare? 

'http://data.nytimes.com 

'http://www.caida.org 


3. Automated strategy selection Given a dynamic graph 
and a query graph, can we choose an effective strategy using 
their statistics? 

Comparison with Other Approaches Although other con¬ 
tinuous subgraph query systems exist ( their objectives are 

different. Both focus on distributed system implementations, and 
explore aggregate queries or approximate queries. Also, their sup¬ 
port for the type of graph is different from ours. Our test datasets 
drawn from cyber security and social networks involve directed 
graphs with labeled vertices and edges. We believe that the research 
contributions complement each other; hence, we compare our im¬ 
plementation with a non-incremental approach that performs sub¬ 
graph isomorphism for the query graph (using VF2) on every new 
edge in the dynamic graph. . 

6.1 Experimental setup 

The experiments were performed on a 32-core Linux system 
with 2.1 GHz AMD Opteron processors, and with 64 GB mem¬ 
ory. The code was compiled with g-l-l- 4.7.2 compiler with -03 
optimization. 

Given a pair of data graph and query graph, we perform either of 
two tasks: 1) query decomposition and 2) query processing. 

Query decomposition: Query decomposition involves loading 
the data graph, collecting l-edge and 2-edge subgraph statistics and 
performing query decomposition using the selectivity distribution 
of the subgraphs. The SJ-Tree generated by the query decomposi¬ 
tion algorithm is stored as an ASCII file on disk. 

Query processing: The query processing step begins with load¬ 
ing the query graph in memory, followed by initialization of the 
SJ-Tree structure from the corresponding file generated in the query 
decomposition step. We initialize the data graph in memory with 
zero edges. Next, edges parsed from the raw data file are streamed 
into the data graph. The continuous query algorithm is invoked 
after each AddEdgeQ call to the data graph. 

6.2 Data source description 

Summaries of various datasets used in the experiments are pro¬ 
vided in Table 1. We tested each dataset with a set of randomly 
generated queries. The following describes the individual datasets 
and test query generation. 

Network Traffic The dataset is an internet backbone traffic dataset 
obtained from www .caida.org CAIDA (Cooperative Associ¬ 
ation for Internet Data Analysis) is a collaborative program that 
provides a wide collection of network traffic data. We used the 
“CAIDA Internet Anonymized Traces 2013 Dataset" for experi¬ 
mentation. The dataset contains 22 million network traffic flow 
(subsequently referred to as netflow) records collected over a one 
minute period. We excluded the traffic to/from IP addresses match¬ 
ing patterns lO.x.x.x or 192.168.x.x. These address spaces refer to 
private subnets and a communication from a given IP address from 
these spaces can actually refer to multiple physical hosts in the real 
word. As an example, every internet service provider configures the 
routers or machines inside a home network with IPs selected from 
the private IP address range. Therefore, if we see a request from 
192.168.1.1 to google.com, there is no way to determine the exact 
origin of this communication. From a graph perspective, allowing 
private IP address and the subsequent aggregation of communica¬ 
tion will result in the creation of vertices with giant neighbor lists, 
which will surely impact the search performance. A detailed list of 
use cases describing subgraph queries for cyber traffic monitoring 
are described in QD 

Social Media Stream Our final test dataset is a synthetic RDF 
social media stream available from the Linked Stream Benchmark 
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(b) Internet Backbone Traffic - CAIDA (c) Synthetic social data stream in RDF 




Figure 6: Edge type distribution shown with the evolution of the dynamic graph. 


Table 1: Summary of test datasets 


Dataset 

Type 

Vertices 

Edges 

Internet Backbone Traffic 

Network traffic 

2,491,915 

19,550,863 

LSBench/CSPARQL Benchmark 

RDF Stream 

5,210,099 

23,320,426 

New York Times 

Online News 

64,639 

157,019 


(LSBench) jTj. We generated the dataset using the sibgenerator 
utility with 1 million users specified as the input parameter. The 
generated graph has a static and a streaming component. The static 
component refers to the social network with user profiles and so¬ 
cial network relationships. The streaming component includes 3 
streams. The GPS stream includes user checkins at various loca¬ 
tions. The Post and Comments stream includes posts and com¬ 
ments by the users, subscriptions by users to forums, and a stream 
of “likes" and “tags". Finally, the photo stream includes informa¬ 
tion about photos uploaded hy users, and “tags" and “likes" as ap¬ 
plied to photos. 

6.3 Selectivity Distribution 

Figure 1^ shows the edge distrihution plotted over time. X-axis 
shows the number of cumulative edges in the graph as it is grow¬ 
ing. The plotted distrihution is not cumulative. The edge distrihu¬ 
tion is collected after fixed intervals. The interval is 10 thousand, 
100 thousand and 1 million respectively. There are 4, 7, and 45 
edge types in these datasets. The first half of the RDF dataset con¬ 
tains data for a simulated social network. The second half contains 
simulated data about the activities in the network such as posts, and 
checkins at locations The shift in the edge distribution around the 
mid point reflects these different characteristics. The key observa¬ 
tion is that the relative order of different types of edges stays similar 
even as the graph evolves. 

There were 14, 62 and 676 unique 2-edge paths present in the 
New York Times, netflow and LSBench datasets. Figure]^ shows 
the 2-edge path distrihution for the LSBench dataset. We found 
a small number of 2-edge subgraphs to dominate the distribution 
across all the datasets. Other datasets show a similarly skewed dis¬ 
trihution, and was omitted for space. The skew is heaviest for the 
LSBench dataset, which is expected given the higher number of 
unique edge types and the larger size of the dataset. 

The goal of this analysis was to observe the variability in the 
selectivity distribution over time. The selectivity distribution is ex¬ 
pected to vary over time. Flowever, it is the relative order of the 
unique single edge or 2-edge subgraphs that matters from the query 



(a) Synthetic social data stream in RDF 


Figure 7: 2-edge path distribution in each test data set. Each 
point on X-axis represents a unique 2-edge path and Y-axis 
shows its corresponding count. 

decomposition perspective. For each of the test datasets, we took 
multiple snapshots of the selectivity order and found it to be stable, 
except with fluctuations for the very low frequency components 
(data points on the left end of the distributions in Fig. 0- Sig¬ 
nificant changes in the selectivity order can adversely impact the 
performance of the query. Estimating the duration over which the 
selectivity ordering stays stable for a given data stream, quantifi¬ 
cation of errors based on shift in the distribution, and adapting the 
query algorithm to handle such shifts is reserved for future work. 

6.4 Query Performance Analysis 

This section presents query performance results obtained through 
query sweeps on the network traffic and social network dataset. 
We restrict the analysis to these two datasets for their larger size. 
The analysis on New York Times dataset made available in the Ap¬ 
pendix section in the interest of space. For each query, we collect 
performance from 4 different query execution strategies obtained 
by l-edge or 2-edge decomposition of a query graph and the lazy 
vs. track everything approach adapted by the query algorithm. The 


































































































following tags are used to describe the plots in the remainder of the 
paper: a) “Single": 1-edge decomposition, search tracks all match¬ 
ing subgraphs in SJ-tree, b) “SingleLazy": 1-edge based query de¬ 
composition, use “Lazy" approach to search, c) “Path": 2-edge de¬ 
composition, search tracks all matching subgraphs in SJ-Tree, and 
d) “PathLazy": 2-edge decomposition with “Lazy" search. 

6.4.1 Network Traffic and LSBench 

We present aggregated results for each query group for LSBench 
and CAIDA. Both of these datasets are orders of magnitude larger 
than New York Times and the scale allows us to magnify the dif¬ 
ferences between multiple strategies. 

Query Generation We generate both path queries and binary 
tree queries for the netflow data. Figure shows two decompo¬ 
sitions of an example query. The vertex labels are fixed to type 
“ip" and the edge types are randomly chosen from a set of 7 proto¬ 
cols: ICMP, TCP, UDP, IPv6, AH, ESP and GRE. The binary tree 
queries were generated following the test generation methodology 
described in eg- The LSBench dataset is tested with path queries 
and n-ary trees. A list of valid triples (vertex type, edge type, vertex 
type) is generated using the LSBench schema. A tree query is gen¬ 
erated by randomly selecting an edge from the set of valid triples 
and then iteratively adding valid new edges from any of the nodes 
available. All our query graphs are unlabeled. Using netflow data 
as an example, we do not generate a query that has a label associ¬ 
ated with any of the nodes. In practice, we expect users to employ 
labeled queries such as finding a tree pattern in the network traffic 
where the root of the tree has a IP address (i.e. label) from a cer¬ 
tain subnet. For social data, we may look for paths with specified 
user ids (node labels) on the source and the destination nodes on 
the path. Here, our experiments are motivated to study the impact 
of subgraph distributional statistics on query processing. 
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Figure 8: 1 and 2-edge based decompositions of a path query 
on netflow traffic data. 

Comparison with others In our previous work 0 we had 
compared the performance of our implementation with the Incl- 
soMatch algorithm proposed by Fan et al. Our IncIsoMatch 
implementation was based on a variant of the well-known VF2 al¬ 
gorithm (^. 

Summarization of Results All queries of the same type 
(path or tree) and size (3-hop length or 5 nodes) are denoted as 
a group. We generated 100 queries for each group and then elim¬ 


inated ones that contained 2-edge paths not seen in the sampled 
path distribution. This was done for two reasons; first, inclusion 
of an unseen 2-edge path combination makes the query artificially 
discriminative. Our goal is to observe query processing time as a 
function of varying selectivity, so including unusually discrimina¬ 
tive queries bias our studies. Second, when asked to generate a 
path-based decomposition, our SJ-Tree generator resorts to gener¬ 
ating a single-edge based decomposition when a query subgraph 
contains an unseen 2-edge path. This would bias our comparison 
between a path-based decomposition and single-edge based decom¬ 
position. Finally, for all the “valid" queries we further sampled 
them by the Expected Selectivity computed using 2-edge path dis¬ 
tribution and reduced each group to a smaller set of queries that 
provide a near uniform sampling of the Expected Selectivity from 
the larger set. Einally, the reported runtime for a given strategy 
(e.g. “PathLazy") is obtained by averaging the runtimes from the 
reduced set of queries, 

Eigure[^-d shows the query processing times collected for both 
datasets. The size of the query processing window was fixed at 
8 M triples, and the performance statistics were collected at at the 
middle and at the end of the graph stream. We profiled different 
components of the query processing such as the time spent in per¬ 
forming subgraph isomorphism and the time spent in updating the 
SJ-Tree. The latter is largely composed of the time spent in looking 
up the hash tables in various nodes of the SJ-Tree, performing joins 
between partial matches and inserting new entries. We found that 
the subgraph isomorphism operation (for 1 or 2-edge subgraphs) 
dominates the processing time. Considering both classes of queries 
with diameter 4 and 5, the subgraph isomorphism operation con¬ 
sumes more than 95% of the total query processing time. 

A general observation is that the performance of non-incremental 
search by VE2 is found to be lO-lOOx slower. The Y-axis is plot¬ 
ted in log scale, and we can see how the run times of the “Path" 
and “Single" approaches rise exponentially as the query sizes are 
increased. Overall, we find the “SingleLazy" and “PathLazy" are 
the best performing search approaches. As the tree queries show, 
the growth rate in the query processing time is much slower for the 
“Lazy" variants. This conclusively demonstrates the effectiveness 
of restricting the search to where a match is emerging, and growing 
the match by starting from the most selective sub-query. 

6.5 Analysis via Relative Selectivity 

Pigure[T^shows the distribution of relative selectivity for queries 
with 4 edges across all three datasets. We picked query graphs with 
4 edges to find a common basis for comparing different type of 
queries (k-partite vs. path queries) across multiple datasets, and the 
discussion is equally applicable to larger or different query class 
combinations. The top subplot shows the relative selectivity of 
10 k-partite queries from the New York Times data. Eor netflow 
and LSBench, we randomly sampled 25 queries from the randomly 
generated path query collection. As can be seen, the relative selec¬ 
tivity is very low for the netflow dataset. Eollowing the definition of 
relative selectivity, its value is lowered when the path distribution 
based selectivity is low. In other words, there are some paths in the 
query which have very low probability of occurrence. Therefore, 
the “PathLazy" approach is superior for such queries. Empirical 
observation on larger path queries and other tree queries seem to 
suggest two prominent clusters of relative selectivity values. The 
first one typically ranges from 0.001 and above, and the second 
one contains values that are smaller by multiple orders of mag¬ 
nitude. This suggests a heuristic that “PathLazy" strategy could 
be employed for queries with relative selectivity below 0.001, and 
“SingleLazy" be employed for queries above 0.001. 
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(b) Runtime for Tree Queries on Netflow data. 



(c) Runtime for Path Queries on LSBench data. 
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(d) Runtime for Tree Queries on LSBench data. 


Figure 10: Distribution of Relative Selectivity across queries 
with 4 edges in 3 datasets. Relative selectivity is shown on X- 
axis in log scale. 

7. CONCLUSION AND FUTURE WORK 

We present a new subgraph isomorphism algorithm for dynamic 
graph search. We analyzed multiple real-world datasets and discov¬ 
ered that the distribution of 2-edge subgraphs are heavily skewed. 
We further demonstrated with a “Lazy" search algorithm that a 
query decomposition strategy exploiting this skew will be consis¬ 
tently efficient. Finally, we concluded with a Relative Selectivity 
based rule for selecting a search strategy. 

The problem of continuous pattern detection is an emerging area, 
and there is an open field to explore. While our 2-edge subgraph 
based approach provides an initial foundation, deeper investiga¬ 
tions are warranted for more accurate selectivity estimation. Sub¬ 
sequent research can leverage on the significant body of work on 
counting larger subgraphs such as triangles in streaming or semi¬ 
streaming scenarios to obtain quantitative estimates of space com¬ 
plexity of a given query decomposition. Adaptive query process¬ 
ing is an important follow-up problem as well. A long standing 
database query needs to be robust against shift in the data charac¬ 
teristics. While we propose a fast algorithm for periodic recompu¬ 
tation of the primitive distribution, we do not address the issues of 
modeling the inefficiency from operating under a different selectiv¬ 
ity order and migrating existing partial matches from one SJ-Tree 
to another. 
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APPENDIX 

A. ANALYSIS OF DYNAMIC GRAPH SEARCH 
ALGORITHM 

At this point, it is probably obvious that different SJ-Tree struc¬ 
tures can be generated from the same query graph (Eigurej^. While 
multiple factors can lead to generation of different SJ-Trees, one 
primary factor is our choice for granularity of decomposition, the 
size and the structure of the subgraphs we decompose the query to. 


Henceforth, we often refer to these set of small subgraphs as search 
primitives or simply primitives. As a first step to understand the 
speed-memory tradeoff associated with different choices for prim¬ 
itives, we begin with the complexity analysis of the dynamic graph 
search described in Algorithm [T] and A key operation in Algo- 
rithm[T]is the process of subgraph isomorphism around every new 
edge in the graph. Therefore, we exclusively focus on the complex¬ 
ity analysis in terms of 1-3 edge subgraphs as candidates for search 
primitives. 

Single Edge Subgraphs When the query graph in Al¬ 
gorithm 1, line 5) contains a single edge, checking if an edge from 
the data graph (ea) matches the query edge require comparing the 
types and potentially other attributes of the edges. Depending on 
the query constraint, we may need to look up the node label to 
perform a string comparison or evaluate a regular expression. The 
node labels or any other node-specific properties are stored in an 
array leading to constant time access to node labels. Therefore, a 
single-edge query can be matched in 0(1) time. 

Triads Assume that the query graph is a triad with three ver¬ 
tices ui, V 2 and vs, and edges ordered as ei = (ui,i> 2 ),e 2 = 
(u 2 , ua), 63 = (v 3 , vi). Eor any edge e in the data graph, we can 
detect a match with ei in constant time. If e is matched, we search 
the neighborhood of the vertex that matches with V 2 to search for 
62 . Denoting this vertex as V 2 , the cost of this second level of search 
is 0 {degree{v2)). In case of a 3-edge subgraph, each of the suc¬ 
cessful second level searches proceed to find a match for the third 
edge. Thus, the cost of a 2-edge subgraph is 0 {degree{v2)) and 
a 3-edge subgraph is 0 {degree{v2) * degree^v^)). We can refine 
these estimates to obtain an average cost of the search as 0 (d 2 ) for 
a 2-edge subgraph and 0 {d2d3) for a 3-edge subgraph, where d 2 
and ds are the average degree of the vertices in the graph for the 
types of V 2 and W 3 . 

The next step is to estimate a cost for the SJ-Tree update opera¬ 
tion (Algorithm]^. We begin with the hash-join operation (Algo¬ 
rithmic line 7). Assume the frequency of a graph g® is rii, where 
the frequency of a subgraph is defined as the count of its instances 
over an edge stream of length N. Therefore, over N edges, we can 
expect 0(ni) matches for g^ and 0 ( 112 ) matches for g^. There¬ 
fore, H 2 (hash table associated with the SJ-Tree node representing 
gq) will be probed for a match 0(ni) times over N edges and Hi 
(associated with the SJ-Tree node representing g^) will be probed 
0 ( 112 ) times within the same period. 

If we knew the frequency of Gq, henceforth referred as fs(Gq), 
then we can also estimate the number of new subgraphs that will be 
produced as the result of the hash-joins. Given that the frequency of 
the larger subgraph can not exceed that of the more selective com¬ 
ponent we can approximate 0(n(G‘^)) ~ min (0(ni), 0(n2))). 
Therefore, the average work for every incoming edge in the graph 
can be expressed as, 

{fs(gl) + fs(gq) + 0(ni) + 0 ( 712 ) -b min (0(ni), 0 ( 712 )))) /N. 

The Hash-Join combined with leaf level searches provides the 
simplest example of a SJ-Tree, a binary tree with height 1. In this 
section, we analyze the time complexity of the query processing as 
it happens in a multi-level SJ-Tree. Given any non-leaf node n, we 
can obtain the expression for average work by adapting the com¬ 
plexity expression shown above. Note that if a child of n, denoted 
by ric, is not a leaf level node but an internal node, then the term 
corresponding to the search cost (fs(g)) disappears. Additionally, 
we can replace the search cost with the cost corresponding to the 
average work incurred by the subtree rooted by Uc. Therefore, 
given a SJ-Tree (Tsj) the average work (G(Tsj)) can be obtained 
by recursive computation from the root. G(Taj) = C(root(Tsj)) 


